The aim of this project was to evaluate whether or not geotagged social media data can be useful in providing insight into a region’s “Sense of Place” using Santa Barbara as a case study. Sense of Place can be defined as the connection people feel to their geographic surroundings, including both the natural and built environment. Locations with a strong sense of place often have a strong identity felt by both locals and visitors.
Geotagged social media data has been used in recent years to study people’s interaction with the natural environment in various ways, many of which are focused on tourism:
This project differs in that we wanted to map the spatial patterns of tourists and locals, and understand how these two user groups engage with and perceive the natural environment of Santa Barbara.
Twitter data was obtained freely through a partnership between UCSB Library and Crimson Hexagon. Before downloading, the data was queried to meet the following conditions:
Acessing Data
Crimson Hexagon only allows 10,000 randomly selected tweets to be exported, manually, at a time in .xls format. Due to this restriction, data was manually downloaded for every 2 days in order to capture all tweets. There were around 5000 average number of daily tweets that met these conditions.
The Crimson Hexagon data did not contain all desired information, including whether or not the tweet was geotagged. To get this information we used the python twarc library to “rehydrate” the data using individual tweet ids and store the tweet information as .json files. From here we were able to remove all tweets that did not have a geotag, giving us a total of 79,981 tweets (including Jan-Apr 2015).
| Month | Day | Time | Year | full_text | user_location | retweet_count | favorite_count | month_num | date |
|---|---|---|---|---|---|---|---|---|---|
| Nov | 24 | 22:15:14 | 2015 | Another great short by @kabtweet about #recycling Watch it here https://t.co/Nkue60fgGn https://t.co/WEIrl1ZVLl | Santa Barbara, CA | 0 | 0 | 11 | 2015-11-24 |
| Apr | 24 | 05:29:53 | 2016 | Duck Duck Moose @ The Nugget https://t.co/ixAn9AOuNL | NA | 0 | 0 | 4 | 2016-04-24 |
| Aug | 29 | 13:44:22 | 2016 | current weather in Santa Barbara: fog, 61°F 87% humidity, wind 11mph, pressure 1014mb | Santa Barbara, CA | 0 | 0 | 8 | 2016-08-29 |
| Dec | 20 | 19:39:57 | 2017 | Still looking for that perfect gift for someone on your list? Give the gift of cider and treat… https://t.co/ugtY1nDNmu | Santa Barbara, CA | 0 | 0 | 12 | 2017-12-20 |
| Oct | 13 | 21:14:25 | 2018 | OK, OK…so not only is the @santabarbaramag #Fall issue cover completely captivating, but if you squint at the story topic at the very bottom of the cover - “Sip + Savor at New… https://t.co/PqDFZYveVn | Los Angeles & Santa Ynez Val | 0 | 1 | 10 | 2018-10-13 |
| Jan | 24 | 01:41:03 | 2016 | She’s a big gurl who is blessed to do what she does. #dragqueen #dragqueens #drag #dragshow… https://t.co/IMsrsf3B1W | Los Angeles, CA | 0 | 1 | 1 | 2016-01-24 |
| Feb | 26 | 01:48:53 | 2017 | James Connolly, the Banjo Man! #banjo #livemusic #m8rxsb @ M8RX Nightclub & Lounge https://t.co/j0LAYMnXNu | Santa Barbara, CA | 0 | 1 | 2 | 2017-02-26 |
The number of geotagged tweets is going down over time. There is a significant drop in tweets at the end of April, 2015. It seems this is due “a change in Twitter’s ‘post Tweet’ user-interface design results in fewer Tweets being geo-tagged” ( source). The first 4 months of 2015 have 15,720 tweets, or roughly 19% of all tweets. To reduce a skew in the data and remove geotagged tweets that may have been geotagged without knowledge by the user in those months, we moved forward with all tweets from May 1, 2015 through the end of 2019.
Given that the tweet dataset is queried to just those that are geotagged - I hypothesize that most of these tweets have a picture or a link to an instagram post. We can detect links by looking for “t.co” in the tweet which is a twitter URL for a separate webpage. These are often twitter or instagram photos but we can’t be 100% certain.
It looks like 93% of geotagged tweets contain a link or picture.
The spatial distribution of tweets highlights areas of higher population density and tourist areas in downtown Santa Barbara.
There is a single coordinate that has over 11,000 tweets reported across all years. It is near De La Vina between Islay and Valerio. There is nothing remarkable about this site so I assume it is the default coordinate when people tag “Santa Barbara” generally. The coordinate is 34.4258, -119.714.
As you zoom in on the map, clusters will disaggregate. You can click on blue points to see the tweet.
Each hexagon shows the log10 density of tweets in that area. The highest number of tweets in a single location is around 11,000 (deep purple hex). This includes the default Santa Barbara coordinate used for geotagging from the city of Santa Barbara without a precise location
This project aims to understand if and how preferences differ between tourists and locals for nature-based places within the Santa Barbara area. In order to test this we needed to come up with a way to identify tourists or locals. We used a two step process.
First, if the user has self-identified their location as somewhere in the Santa Barbara area, they are designated a local. This includes Carpinteria, Santa Barbara, Montecito, Goleta, Gaviota and UCSB. For the remainder, we use the number of times they have tweeted from Santa Barbara within a year to designate user type. If someone has tweeted across more than 2 months in the same year from Santa Barbara, they are identified as a local. This is consistent with how Eric Fischer determined tourists in his work. This is not fool-proof and there are instances were people visit and tweet from Santa Barbara more than two months a year, especially if they are visiting family or live within a couple hours driving distance.
There are 21811 tweets from tourists and 45420 tweets from locals (32% and 68%). There are 12460 unique tourists and just 1893 unique local users.
The following map shows areas that have more tweets from locals (orange) or tourists (purple). Note the values indicate the log10 of the absolute difference between number of tweets from each user group. If a hex is purple and has a value of 2, this means there are 100 times more tweets from tourists than locals at that location.
The full text of each tweet was analyzed to be either nature-based or not. We developed a coarse dictionary of words that indicate a nature-based tweet. These include natural features like ocean, coast, park, and works that indicate recreating (fishing, hiking, camping, etc.).
## [1] "hike" "trail" "hiking" "camping" "tent"
## [6] "climb" "summit" "fishing" "sail" "sailing"
## [11] "boat" "boating" "ship" "cruise" "cruising"
## [16] "bike" "biking" "dive" "diving" "surf"
## [21] "surfing" "paddle" "swim" "ocean" "beach"
## [26] "[^a-z]sea" "sand" "coast" "island" "wave"
## [31] "fish" "whale" "dolphin" "pacific" "crab"
## [36] "lobster" "water" "shore" "marine" "seawater"
## [41] "lagoon" "slough" "saltwater" "underwater" "tide"
## [46] "aquatic" "[^a-z]tree" "[^a-z]earth" "weather" "sunset"
## [51] "sunrise" "[^a-z]sun" "climate" "park" "wildlife"
## [56] "[^a-z]view" "habitat" "[^a-z]rock" "nature" "mountains"
## [61] "[^a-z]peak" "canyon" "pier" "wharf" "environment"
## [66] "ecosystem" "flower"
Let’s look at some examples of what tweets qualified as “nature-based”.
| date | full_text | user_location | user_type | nature_word |
|---|---|---|---|---|
| 2016-02-06 | life is good today ☀️🌴🌊 @ carpinteria state beach https://t.co/pt7dzcjo21 | NA | tourist | 1 |
| 2018-09-17 | our 6th graders woke up super early with excitement to board their bus headed toward their cimi adventures. a week full of ocean fun! #cimi #marymountofsantabarbara #tripweek @ marymount… https://t.co/a057dkiqxv | Santa Barbara, CA | local | 1 |
| 2018-12-05 | 🍿 ‘tis the season! @shoppaseonuevo + @metrotheatres want to help you #surprise the movie 🎥 lover in your life irl! #tag 🏷 someone you would like to #give the #gift of #movies to in the… https://t.co/llyt1iertt | Santa Barbara, CA | local | 1 |
| 2016-05-11 | she needs the water | maddzmo #vintage #aloha #nikon #california #santabarbara #sandy… https://t.co/jzgapq8yi9 | Ventura, CA | local | 1 |
| 2016-07-02 | flight of the conchords. #rocks droll @ santa barbara bowl https://t.co/fzcwnefogw | Concord, CA | tourist | 1 |
| 2015-08-16 | 👆👆 check out our “non-tent” setup last night!!! it was so stunning!!! @caid805 #damwhittheyremarried… https://t.co/y57cb1g53l | So Cal & Destination Events | local | 1 |
| 2017-12-07 | #santabarbara #thomasfire @ inn at east beach https://t.co/dlhbxrprdi | NA | local | 1 |
All groups show increases in proportion of tweets that are nature based over time, even as the number of geotagged tweets declines.
Not surprisingly there are less nature-based tweets than non-nature-based 24% of all geo-tagged tweets are nature-based.
Of local tweeters, 21% of tweets are nature-based. Of tourists, 30% are nature-based.
To link tweet locations to what exists at those locations we need to use a spatial dataset that tells us what is there. This could be roads, city parcel information, or in our case we are using protected areas from the California Protected Areas Database.
The CPAD is a GIS dataset depicting lands that are owned in fee and protected for open space purposes by over 1,000 public agencies or non-profit organizations.
We can look at the top 20 most popular tweeted-from sites. The green highlighted portion represents nature-based tweets. The number indicates what percentage of all tweets are nature-based at each site. Names in bold indicate over 50% of tweets are nature-based.
Here we compare the proportion of nature-based tweets from CPAD areas by both tourists and locals. Sites that are above the line have a higher proportion of nature-based tweets from tourists, those below see more from locals. The size of the point indicates the total number of tweets from those locations. There isn’t a significant difference between the two user groups, indicating that these sites see a similar proportion of nature-based tweets from both groups.
Going a bit further, we also looked at number of unique visitors to these CPAD sites. By calculating the proportion of unique tourists and locals that visit these sites, we start to look at who goes where. This is not limiting tweets to only those that are nature-based.
At the lower end we see more locals than tourists visiting these sites. These tend to be less popular areas. On the upper end, we see sites that are more frequented overall, and more frequented by tourists. These include well-known areas like the Santa Barbara Harbor and Stearn’s Wharf. Those on the lower end that locals frequent more are either lesser-known (Shoreline Park, Alameda Park are both neighborhood parks), or further from main tourist areas (e.g. Goleta Beach)
We have 3846 unique users from within CPAD areas, representing 27% of all unique users in the dataset.
Using a Term Frequency-Inverse Term Frequency (TF-IDF) analysis we can identify words within tweets that are not only most common (e.g. “the”, “to”, “santa barbara”), but most “important”. TF-IDF is measure of how important a word is to a document in a corpus of documents, or in this case how important a word is to all nature-based tweets.
Most important words across all nature-based tweets in Santa Barbara
Most important words for a select number of CPAD areas
We can apply a sentiment analysis to the twitter data to try and understand patterns and trends in the general sentiment of tweets.
The top graph shows the total number of geotagged tweets, which has gone down over time across tourists and locals.
The bottom graph shows average daily sentiment scores over time. Above 0 is positive, below 0 is negative. We see that tweets are mostly positive and growing over time.
Examining scale
Applying the same or similar method to other regions of different geographic and population sizes could reveal more interesting information and provide context for the patterns and trends we see in Santa Barbara.
Is Santa Barbara unique in that:
- tourists and locals have similar spatial patterns
- 24% of all geo-tagged tweets are nature-based
- Proportion of nature-based tweets is increasing as geotagged tweets decrease overall, and positive sentiment is increasing over time
We might expect the tourist/local alignment to differentiate when looking at highly urban areas (LA, San Francisco), show more alignment in other suburban areas (e.g. Santa Cruz), and maybe not exist in rural locations.
If we look at proportion of tweets that are nature-based across these rural-suburban-urban scales, we may reveal where sentiments or Sense of Place around the natural environment are higher or lower. For example, we would expect a lower proportion of nature-based tweets in New York compared to Santa Barbara. We could also compare the city to state level. Across all geotgagged tweets in California, what is the proportion of nature-based tweets?
If this method is replicated going forward, there are a few areas where refinement and better data could be improved.
Identifying tourists and locals
If we had access to a larger twitter dataset, we could identify where tourists are “from” (or where they tweet more consistently) to confirm their tourist status, instead of relying on the number of months a user tweets within an area.
Nature-based dictionary
The dictionary compiled for this project was based solely on my own perspective of nature-based words. It also leaned heavily on what I would expect people to tweet about in Santa Barbara (e.g. “lobster”, “islands”, “wharf”). Ideally a dictionary used to identify nature-based tweets would be developed using more robust methods across a more geographically representative area.
Spatial data for natural areas
The CPAD dataset is good but not perfect. Some place names needed to be edited and some polygons needed to be fixed. This would not have been possible without extensive local knowledge of Santa Barbara. To scale this analysis to larger areas, you would want to ensure the underlying “natural area” dataset is appropriate.
Bias in data
There is inherent bias in using social media data to draw broader conclusions about a community. Not everyone has access to social media or uses social media in a similar manner. There are differences across all demographics (genders, ages, ethinicities, economic status) and these were not taken into consideration during this project but should be considered if this is to be expanded upon. There are also differences in who decides to make their account public and explicilty chooses to geotag their tweet (Sloan & Morgan 2015).
Goals
Proposed approach
Additional figures to supplement the analysis.
Threshold for defining tourists/locals
If we just look at proportion of nature-based tweets we see a different ordering. I removed any places with just 1 tweet since it will skew results if that tweet happens to be nature-based (a total of 4 places).
This chart shows all CPAD areas and the proportion of tweets that are nature-based. The total number of tweets is represented by the width of the line.
The highest ratio of nature tweets to non-nature takes place at Lookout Park and Beach.
The idea here is to use the data to identify places where the majority of tweet content is nature-based but it does not align within a designated area. This could be used to indicate places that maybe should be recognized or protected but currently aren’t.
The top 10 positive and negative words found across all tweets. There are many more instances of the positive words than negative.
We see that generally “joy” and “positive” are the types of tweets we see most.
Top 100 words across all Santa Barbara geotagged tweets